Skip to content

Refine OCR pipeline and update bench gating#26

Open
mapo80 wants to merge 1 commit intomainfrom
codex/update-ocr-parameters-for-markitdown
Open

Refine OCR pipeline and update bench gating#26
mapo80 wants to merge 1 commit intomainfrom
codex/update-ocr-parameters-for-markitdown

Conversation

@mapo80
Copy link
Owner

@mapo80 mapo80 commented Aug 21, 2025

Summary

  • expand MarkItDownOptions with detailed OCR controls (DPI, PSM, OEM, threads, deskew, color depth)
  • preprocess images with optional scaling, grayscale, DPI metadata and gated deskew
  • wire Tesseract engine to new options and thread limit
  • enhance OcrBench with selective refresh, per-file logging and quality gates
  • refresh markitdownnet OCR artifacts (eng, PSM=6, DPI=300)

Testing

  • dotnet test
  • dotnet run --project tools/OcrBench -- extract --input-dir dataset/validation --out-dir dataset/validation/_ocr --threads 1 --langs eng --psm 6 --refresh markitdownnet
  • dotnet run --project tools/OcrBench -- compare --ocr-dir dataset/validation/_ocr --out-json artifacts/validation/OCR/bench-ocr.json --out-md artifacts/validation/OCR/summary-ocr.md (fails: GLOBAL Token-F1 0.7036 line_F1 0.3879)

https://chatgpt.com/codex/tasks/task_e_68a786d928ac8325bb19b5e3bce40bff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant